Large-scale whole-exome sequencing analyses identified protein-coding variants associated with immune-mediated diseases in 350,770 adults

The genetic contribution of protein-coding variants to immune-mediated diseases (IMDs) remains underexplored. Through whole exome sequencing of 40 IMDs in 350,770 UK Biobank participants, we identified 162 unique genes in 35 IMDs, among which 124 were novel genes. Several genes, including FLG which is associated with atopic dermatitis and asthma, showed converging evidence from both rare and common variants. 91 genes exerted significant effects on longitudinal outcomes (interquartile range of Hazard Ratio: 1.12-5.89). Mendelian randomization identified five causal genes, of which four were approved drug targets (CDSN, DDR1, LTA, and IL18BP). Proteomic analysis indicated that mutations associated with specific IMDs might also affect protein expression in other IMDs. For example, DXO (celiac disease-related gene) and PSMB9 (alopecia areata-related gene) could modulate CDSN (autoimmune hypothyroidism-, psoriasis-, asthma-, and Graves’ disease-related gene) expression. Identified genes predominantly impact immune and biochemical processes, and can be clustered into pathways of immune-related, urate metabolism, and antigen processing. Our findings identified protein-coding variants which are the key to IMDs pathogenesis and provided new insights into tailored innovative therapies.

In this paper the authors describe the results of very comprehensive analysis from whole exome sequencing material containing 350770 UK Biobank participants.The aim was to identify the putative variants, especially predicted loss-of-function and predicted deleterious missense, across immune-mediated diseases and elucidate their clinical impacts in relation to biochemical processes.The study identified 162 unique genes across 35 immune-mediated diseases with 124 being novel.The results show a lot of interesting and important new information of the genes and their rare and common, risk and protective SNP variants involved in immune-mediated diseases.The data showing that mutations influenced disease onset, causal evidence of gene-disease associations, shared protein expression in different immune-mediated diseases and validation analysis with an independent exome-sequenced cohort are very important information for the clinical and research community working with immune-mediated disease.For a researcher, the extensive data tables may provide a welcome possibility to evaluate the meaning of not wellknown gene variants on specific immune-mediated diseases.In addition to aforementioned aspects I have some specific comments: -Nowadays, instead of using Caucasian individuals, you should use European-descent.In my mind this kind of paper needs to describe more precisely the cohort used for the study, especially when we know that the frequencies of gene variants related to immune-mediated diseases varies between populations as is shown in this paper between study and validation cohorts.Yang et al. uses data from the UK Biobank to perform gene-level association with immune associated diseases.Methodological choices in the identification of people with specific diseases, association analyses, and linking non-coding variants to the genes that they regulate limit this reviewer's enthusiasm.

Major concerns:
• The algorithms used to identify the immune associated diseases need to be more clearly presented and referenced for their effectiveness.
• The manuscript starts with a gene-based assessment.One of the findings presented as being consistent with prior literature is FLG with atopic dermatitis -but the reported effect size is much less than previous studies that focus on loss of function variants in FLG.
• ANNOVAR was published in 2010 and was a good choice for years.In 2024, using 500 kb windows to link DNA with genes is not appropriate.Methods to consider include eQTLs, ABC, and similar approaches that use functional genomic data to match DNA to the gene that they regulate.
• The Replication studies were presented such that p-values less than 0.05 were considered replication.This level of replication can be expected by chance given the number of association analyses.Permutation testing is needed to identify if the signals did truly replicate more than expected by chance.• The UK Biobank has been used by countless studies that focus on immune-related diseases.It is critical for this manuscript to clearly identify how this study relates to, replicates, and builds upon at least some of these studies including by identifying methodological differences.

Minor
• It is unclear what is meant on line 73 "comprehensive and solid WES studies".Consider rewording.
• The term "Caucasian" is problematic and should be removed throughout.Consider following updated guidelines around ancestry and race when naming groups of people.
• The term "risky pattern" (e.g.line 107) should be replaced with more precise terminologies.
Reviewer #3 (Remarks to the Author): The authors present a large, systematic study of both rare and common coding variant effects on 35 immune mediated diseases (IMDs).By doing so, they discover novel disease-gene relationships and also characterise pervasive pleiotropy present amongst IMDs and other traits and biomarkers.
Overall, I think the authors results and conclusions are sound.I have several minor comments that should help increase the clarity of the presented work: 1.The paper is poorly written and has many different grammatical mistakes.For example, L56-57, L59, L107, L140, L249, just to name a few.The authors should work with the editor to ensure that this is systematically addressed before publication.
2. It was unclear to me how P < 0.02 (FDR) was determined?What P-value threshold does this come to, and how was it deduced?3. Are the authors correcting for the fact they have performed 35 exome-wide association analyses?For example in the Finngen analysis, a replication P-value of 0.05 is used but the author should present replication at P < 0.05/number_of_traits_analysed 4. Do the authors adjust for sex, age or BMI in the exome associations?IMDs are very sex specific both in terms of incidence and genetics, the authors should adjust for sex or ideally, carry our sexspecific analyses.
5. How do the authors intend to share their results with the community?Is there a website that allows researchers to query these results and download summary statistics.I insist for all reviews that both code (github) and summary level findings are made publicly available.

Responses to the Reviewers
The authors sincerely appreciate the critical reviews of the paper, and for the helpful way in which the reviewing editors put together a constructive list of suggestions for the revision of the paper.We have now revised the paper to carefully address all the points raised.Our responses below are preceded by " -", and changes made to the paper are shown below within "…", and in red font in the revised paper.For a researcher, the extensive data tables may provide a welcome possibility to evaluate the meaning of not well-known gene variants on specific immune-mediated diseases.In addition to aforementioned aspects I have some specific comments: 1. Nowadays, instead of using Caucasian individuals, you should use European-descent.In my mind this kind of paper needs to describe more precisely the cohort used for the study, especially when we know that the frequencies of gene variants related to immune-mediated diseases varies between populations as is shown in this paper between study and validation cohorts.

Response:
-Many thanks for your positive feedback towards our article.According to your suggestions and the updated guidelines around ancestry and race, we have changed the term "Caucasian" to "European-descent" in the revised manuscript.
2. Please describe the control population.

Response:
-Thanks for your rigorous concerns.We defined participants with IMDs as cases, otherwise as controls.Case status was determined at last follow-up.The IMDs phenotypes were ascertained and classified according to the in-and out-patient International Classification of Disease, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) codes and Read codes.
-We have added some details in the revised Methods section, Page 23, Line 472-474 "We defined participants with IMDs as cases, otherwise as controls.Case status was determined at last follow-up."

Response:
-Thanks for your rigorous consideration.We are sorry for the spelling mistakes and have corrected the wrong spelling to "deleterious".

Response:
-We are sorry for the spelling mistakes and have corrected the wrong spelling to "asthma".

Response:
-We are sorry for the spelling mistakes and have corrected the wrong spelling to "asthma".

Response:
-Thanks for pointing out this issue.We are sorry for that we didn't add the numerical order of the figures at the end of the document.We have added the "Figure 1-Figure 6" beneath the related figures.We are not sure whether we understand your comments correctly, please feel free to contact us if you have any other comments.
Reviewer #2 (Remarks to the Author): Yang et al. uses data from the UK Biobank to perform gene-level association with immune associated diseases.Methodological choices in the identification of people with specific diseases, association analyses, and linking non-coding variants to the genes that they regulate limit this reviewer's enthusiasm.
Major concerns: 1.The algorithms used to identify the immune associated diseases need to be more clearly presented and referenced for their effectiveness.

Response:
-Thanks for your rigorous concerns.The IMDs phenotypes were ascertained and classified according to the in-and out-patient International Classification of Disease, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) 1 codes and Read codes [2][3][4][5][6] .Data sources include the UKB health outcome records' first occurrences of health outcomes (Category 1712), hospital admission data (Field ID 41270, 41271) and primary care (Category 3000), while self-report cases were excluded.The diagnose codes for IMDs were provided in Supplementary Table 26.2. The manuscript starts with a gene-based assessment.One of the findings presented as being consistent with prior literature is FLG with atopic dermatitis -but the reported effect size is much less than previous studies that focus on loss of function variants in FLG.

Response:
- We greatly appreciate your astute observations and valuable insights.We acknowledge that there are two main reasons for the difference in effect sizes between our study and previous studies: the population characteristics of UKB and the intrinsic properties of analytical models we used.
-We have provided the potential reasons in the Discussion section, Page 21, Line 439-446 ("Third, the UK Biobank population may reflect volunteer bias and survivor bias with a sample of healthier individuals than the general UK population 2 , and therefore may show lower frequency of putative pathogenic variants and lower penetrance.Fourth, some effect sizes for identified genes from collapsing analysis are lower than expected from previous studies.We hypothesized that it was primarily due to usage of saddle point-approximation-corrected logistic mixed-model approach implemented in SAIGE-GENE+ software, which might yield slightly conservative effect estimates, particularly when assessing significance for binary traits with imbalanced case-control ratios 2 .")References: 1 3. ANNOVAR was published in 2010 and was a good choice for years.In 2024, using 500 kb windows to link DNA with genes is not appropriate.Methods to consider include eQTLs, ABC, and similar approaches that use functional genomic data to match DNA to the gene that they regulate.

Response:
-Thanks for pointing out this issue.According to your suggestions, in order to match DNA to the gene they regulate, we combined positional mapping with eQTL and Chromatin interaction mapping by FUMA (https://fuma.ctglab.nl/).A total of 489 genes were mapped by position or eQTL or Chromatin interaction mapping and 109 genes (94.8% overlapped with ANNOVAR) were consistently mapped.Therefore, we believe that the results of these two annotation strategies (positional annotation and functional annotation) are not very different.
-We have added this issue in Supplementary Table 7 and revised manuscript, Page 9, Lines 167-173: Results -"Different from positional annotation conducted by ANNOVAR, we also combined positional mapping with eQTL and Chromatin interaction mapping by FUMA (https://fuma.ctglab.nl/) to find the gene that they regulate.A total of 489 genes were mapped by positional or eQTL or Chromatin interaction mapping and 109 genes were consistently mapped (94.8% overlapped with ANNOVAR; Supplementary Table 7-8).").

Response:
-Thanks for your rigorous concerns.We first apologize for our ambiguous expression.In our study, to validate our discerned associations in an independent cohort, we leveraged summary statistics from the FinnGen Consortium online results (version 8) 1 .The summary statistics were publicly online available (https://r8.finngen.fi/).We searched for the coefficient value and p-value for each gene identified in our gene-disease associations and selected the strongest associations (i.e., lowest p value) per gene.And for common mutations, we obtained the coefficient and p-value for the according variant if available.
-We now acknowledge that our primary focus was not framed as a replication but rather aimed at providing evidence to support our results.We have clarified this in the Revised  27."). References: 1 Kurki, M.I., Karjalainen, J., Palta, P. et al.FinnGen provides genetic insights from a wellphenotyped isolated population.Nature 613, 508-518 (2023).
5. The Replication studies were presented such that p-values less than 0.05 were considered replication.This level of replication can be expected by chance given the number of association analyses.Permutation testing is needed to identify if the signals did truly replicate more than expected by chance.

Response:
-Thank you for your rigorous consideration.First, to be clear again, we leveraged summary statistics from the FinnGen Consortium (version 8) 1 to validate our discerned associations.
Unfortunately, the permutation testing can't be conducted with the summary-level statistics.
6.The UK Biobank has been used by countless studies that focus on immune-related diseases.
It is critical for this manuscript to clearly identify how this study relates to, replicates, and builds upon at least some of these studies including by identifying methodological differences.

Response:
-Thank you for your kind comments!Most of the studies concerning IMDs in UKB has focused on the genetic risk loci, ethnic distribution, identification of therapeutic targets of IMDs, comorbidities, and their shared or distinct genetic components.However, the majority of these study are based on GWAS while literature retrieval discovered that there is only a few WES of IMDs in the UKB.Notably, our study is not only a methodological innovation and extension of the previous approach to the identification of IMDs susceptibility genes/SNPs (through whole exome sequencing), but also the most comprehensive and extensive WES of IMDs (40 IMDs in 350,770 UK Biobank participants) to date.To clarify, we have added some details in the Introduction section and discussion section.
-Page 5, Lines 78-87: "The UK Biobank (UKB), rich in multi-omic data, stands as an ideal platform for such endeavors, and has been widely used for sequencing studies of human diseases and traits.Prior investigations within the UKB, primarily through GWAS, have focused on genetic risk factors 1,2 , ethnic disparities 3 , identification of IMDs therapeutic targets [4][5][6] , associated comorbidities 7,8 , and their unique or overlapping genetic landscape 9 .
However, WES studies of IMDs in the UKB have been sparse.Some have targeted specific regions, such as the HLA region for 11 autoimmune diseases 10 , or have investigated asthma risk mutations among predetermined variants 1`.Additionally, many of these studies were constrained by their sample sizes; for instance, one identified the TET2 mutation as a risk factor for gout among only 170,000 participants 12 ." -Page 20, Line 421-426: "The major strength of this study lies in not only the systematic investigations of protein coding variants in a series of 40 IMDs, but also the identification of a list of novel genes and exploration of their biological relevance.By virtue of focusing on coding variants, the observed associations more often provide a direct causal link between variants in a gene and a specific IMD 13 , having implications for identifying or validating drug targets."

-
Please describe the control population.-Page 4 (72): deleterious -Page 6 (108): asthma -Page 7(126): asthma -Page 44: Fig1 miss the letters A, B, C, D -Figures miss numbers Reviewer #2 (Remarks to the Author): -scale whole exome sequencing of immune-mediated diseases in 350770 adults Corresponding Author: Jin-Tai Yu In this paper the authors describe the results of very comprehensive analysis from whole exome sequencing material containing 350770 UK Biobank participants.The aim was to identify the putative variants, especially predicted loss-of-function and predicted deleterious missense, across immune-mediated diseases and elucidate their clinical impacts in relation to biochemical processes.The study identified 162 unique genes across 35 immune-mediated diseases with 124 being novel.The results show a lot of interesting and important new information of the genes and their rare and common, risk and protective SNP variants involved in immunemediated diseases.The data showing that mutations influenced disease onset, causal evidence of gene-disease associations, shared protein expression in different immune-mediated diseases and validation analysis with an independent exome-sequenced cohort are very important information for the clinical and research community working with immune-mediated disease.

6 .
Page 44: Fig1 miss the letters A, B, C, D Response: -Thanks for your rigorous consideration.We have added the letters A, B, C and D in the revised Figure 1 and the related Figure legends.

Manuscript and deleted redundant information (Page 7 ,
Lines 129-133: Results-"In order to validate the gene-based associations in UKB, we searched from Kurki et al' s summary statistics analyzed from FinGenn dataset.".Page 27-28, Lines 573-598: Methods -"To validate our discerned associations in an independent cohort, we leveraged summary statistics from the FinnGen Consortium online results (version 8) 1 .… The summary statistics were publicly online available (see data availability).Genotype-, sample-and variant-wise quality control and filtering procedures can be found in previous study 1 .…A Bonferroni-corrected threshold of P<1.43×10 -3 (0.05/35 IMDs) was considered to be supported.The precise diagnosis code, along with the case and control distribution for each phenotype, is delineated in Supplementary Table